NETWORK VIDEO METHOD 



CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims priority from provisional application Serial No. 
60/214,457, filed 06/30/00. 

BACKGROUND OF THE INVENTION 

The invention relates to electronic devices, and more particularly to video 
coding, transmission, and decoding/synthesis methods and circuitry. 

The performance of real-time digital video systems using network 
transmission, such as the mobile video conferencing, has become increasingly 
important with current and foreseeable digital communications. Both dedicated 
channel and packetized-over-network transmissions benefit from compression of 
video signals. The widely-used motion compensation compression of video of 
H.263 and MPEG uses l-frames (intra frames) which are separately coded and 
P-frames (predicted frames) which are coded as motion vectors for macroblocks 
of a prior frame plus the residual difference between the motion-vector-predicted 
macroblocks and the actual. 

Real-time video transmission over the Internet is usually done using the 
Real-time Transport Protocol (RTP). RTP sits on top of the User Datagram 
Protocol (UDP). The UDP is an unreliable protocol which does not guarantee the 
delivery of all the transmitted packets. Packet loss has an adverse impact on the 
quality of the video reconstructed at the receiver. Hence, error resilience 
techniques have to be adopted to mitigate the effect of packet losses. A 
common heuristic technique used is the frequent periodic transmission of I- 
frames in order to stop the propagation of errors by P-frames. That is, the motion 
compensation is adjusted to increase the number of l-frames and 
correspondingly decrease the number of P-frames. 

However, this reduces the transmission rate because l-frame encoding 
requires many more bits than P-frame encoding. 
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SUMMARY OF THE INVENTION 

The present invention provides a method of motion compensated video for 
transmission over a packetized network which trades off repeated transmission 
of a P-frames and the l-frame rate. 

This has advantages including improved performance. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a preferred embodiment Markov chain model. 
Figure 2 is a functional block diagram of a preferred embodiment encoder. 
Figures 3a-3d and 4a-4d show experimental results. 
Figure 5 illustrates a system. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

1 . Overview 

Preferred embodiment encoders and methods for motion compensated 
video transmission over a packetized network are illustrated generally in 
functional block form in Figure 2. The preferred embodiments apply a Markov 
chain model (illustrated in Figure 1) to control motion compensation compression 
by determining the rate of l-frames: a lower l-frame rate allows for repeated 
transmissions of P-frames as a forward error correction (FEC) method. This 
contrasts with the approach of increasing the l-frame rate and not repeating P- 
frames. In particular, the preferred embodiments maximize the probability of 
error-free reconstruction of frames as a function of the rate of l-frame 
transmission; a lower l-frame transmission rate allows for repeated transmissions 
of P-frames and thus increased probability of error free reception of P-frames. 

2. First preferred embodiments 

Figure 1 shows a Markov model for a first preferred embodiment system 
having two states: So the state when the current video frame reconstruction has 
no errors and Si the state when the current video frame reconstruction has at 
least one error. The probabilities are as follows: qo is the probability a 
transmitted frame is an l-frame and qi = 1-qo is the probability a transmitted 
frame is a P-frame; B-frames are ignored for this analysis. The probability a 
transmitted l-frame is lost is p e o and the probability a transmitted P-frame is lost is 
Pei. Thus Figure 1 shows remaining in state So with probability qo(1-Peo) + 
qi(1-Pei) which simply is the probability that an l-frame was transmitted and not 
lost plus the probability that a P-frame was transmitted and not lost. Similarly, 
the system remains in state Si with probability 1-qo(1-p e o) which simply states 
that the only way to avoid a reconstruction error for a frame following an 
erroneous reconstructed frame is to receive (not lost) a transmitted l-frame 
because errors propagate in P-frames. Thus qo(1-Peo) also is the probability for 
transition from state Si to state So. Conversely, the probability of transition from 
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state S 0 to state S^ is just the probability of losing the next frame which is simply 
qoPeo + qiPei; that is, 1 minus the probability of remaining in state S 0 . Thus the 
overall probability of being in state S 0 is qo(1-Peo)/(qo + qiPei) which is just the 
probability of an Si to S 0 transition divided by the sum of the probabilities of a 
state transition. Note that q 0 is equal to the reciprocal of the period (in frames) 
between l-frames; that is, if every nth frame is an l-frame, then the probability of 
a transmitted l-frame is 1/n. 

Each transmitted packet over the Internet consists of compressed video 
data, an RTP header, and a UDP/IP header. Let v denote the number of bits in a 
packet header. For RTP/UDP/IP-based systems, v = 320. Because of this huge 
packet overhead, it is better to transmit as many source bits as possible in a 
single packet. The total size of the packet is limited by the maximum 
transmission unit (MTU) of the packet network. For Ethernet, the MTU is about 
1500 bytes. Current Internet video applications use relatively low bitrates; and at 
low bitrates multiple P-frames can be fit into a single packet. A problem with 
transmitting multiple P-frames in a single packet is that the effect of packet loss 
becomes very severe because loss of a single packet leads to the loss of 
multiple P-frames. Hence, only one P-frame is transmitted in a packet. With an 
MTU of 1500 bytes, l-frames, however, do not fit into a single packet and have to 
be split across multiple packets. For ease of description, let: 

lo denote the average size of an l-frame expressed in bits. 

h denote the average size of a P-frame in bits. 

ni denote the number of packets required for a single l-frame. 

k 0 denote the total number of bits (compressed bitstream plus header bits) 
used to transmit an l-frame, so k 0 = lo + niv where v is the packet header size in 
bits. 

ki denote the total number of bits used to transmit a P-frame. 

R T denote the maximum transmission bit rate allowed. 

q f i denote the number of times each P-frame is retransmitted. 
Presume a constant frame rate of f frames per second. Then the bit rate of the 
source, Rs, can be expressed as R s = qofko + qifki and the forward error 
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correction bit rate, R F , which adds qn retransmissions of each P-frame, is R F = 
qiqnfki with q f1 nonnegative. Thus the total transmission rate, R, is R = Rs + Rf 
= q 0 fk 0 + qifk 1 +q 1 q f1 fk i . 

Let p e be the packet loss rate (assumed to be random) encountered on 
the Internet. Because only P-frames are retransmitted, the probability of loss of 
an l-frame is given by 

Peo=1 -(1-Pe) nl 

This just means that if any of the ni packets containing a portion of an l-frame is 
lost, then the entire l-frame is lost. Similarly, the probability of loss of a P-frame 
is given by 

P.i = (1-m l )p a ^Um l p, r ^ >1 
where LqnJ is the largest integer not larger than qn, [qnl is the smallest integer 
not smaller than qn, and mi is the fractional part of qn, that is, mi = qn - LqnJ. 
Heuristically, if q f1 were an integer, then the probability of losing all 1+qn packets 
containing a P-frame would be the probability of losing the P-frame and so p e i = 
p e 1+qf . For noninteger q f1 the foregoing expression for p e i is just the linear 
interpolation between integer values bracketing qn. 

The preferred embodiment FEC method then determines the rate of I- 
frame and repeated P-frame transmissions which maximizes the probability of 
being in state S 0 ( =qo(1-Peo)/(qo + qiPei) ) given the constraint that R < R T . Note 
that for a given probability of l-frame transmission, q 0> the value of qn 
immediately follows from taking the transmission rate R = q 0 fk 0 + qifki + qiqnfki 
equal to the maximum transmission rate, Rj because f, k 0 , and ki are fixed 
parameters of the system and qi = 1-qo. Further, note that periodic transmission 
of l-frames implies q 0 is of the form 1/n where n is the period in frames between 
two l-frames and is an integer. Thus just evaluate the constrained probability of 
being in state S 0 for all reasonable values of n and pick the qo which maximizes 
the probability. 

3. Experimental results 

Two common test video sequences, "Akiyo" and "Mother and Daughter", 
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were used to evaluate the foregoing preferred embodiment method using the 
Markov model. The channel packet loss rate is assumed to be p e = 10%. 
Whenever a frame or portion of a frame (in the case of an l-frame) is not received 
at the receiver, the evaluation simply copied the corresponding picture data from 
the previous frame. Note that because a large amount of data is lost with each 
packet loss, many of the more complicated error concealment techniques do not 
provide improved performance. The evaluation used two metrics: (i) average 
peak signal to noise ratio (PSNR) and (ii) fraction of frames reconstructed at the 
receiver that have a PSNR distortion of less than a threshold; the PSNR was 
obtained by averaging PSNR over 100 runs of transmitting the video bitstreams 
over a simulated packet loss channel, and the fraction of frames reconstructed 
for a distortion threshold t is denoted d t . 

The maximum total bitrate, R T , was taken to be about 50 kb/s; and the 
quantization parameter was taken to be 8 for compressing the video sequences. 
For both video sequences, q 0 = 1/6 results in a bitrate around 50-55 kb/s at f = 10 
frames/s; hence, the set of q 0 s used was q 0 = 1/6, 1/8, 1/20. Note that the 
source bitrate decreases as q 0 decreases. In the range q 0 = 1/6 to 1/20, q 0 = 1/6 
corresponds t the case of maximum rate of transmission of l-frames. For each 
of the video sequences, eight bitstreams were generated, one for each value of 
q 0 . Frame lengths lo and U used for the Markov chain analysis were obtained by 
averaging the l-frame and P-frame lengths, respectively, of the compressed 
bitstreams; and ni = 3 was used based on the l-frame size and MTU 
consideration. 

For "Akiyo" the following list summarizes the parameters used for the 
Markov chain model: 

Pe = 0.1 

f = 1 0 frames/s 

average size of l-frame, lo = 20,475 bits 
average size of P-frame, Ii = 1,711 bits, 
R T = 52.89 kb/s 
rii = 3 
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q 0 in set 1/6, 1/8, 1/20 
Figure 3a shows the resulting Pr(So), the probability of being in state So, Figure 
3b shows the average PSNR for various values of qo, and Figure 3c shows the 
resulting fraction of reconstructed frames with distortion less than threshold, d t . 
To obtain Figures 3b and 3c, the P-frame retransmission rate, qn, derived from 
the Markov chain analysis was manually tweaked so that the total bitrate (source 
rate + FEC rate) was very near to the source bitrate (also the total bitrate) for qo = 
1/6. This was done to provide a fair comparison of results. Figure 3d shows the 
resulting total bitrate. In Figure 3d R s denotes the source rate, R F denotes the 
rate used by the FEC, and R T denotes the total bitrate. 

As can be seen from Figure 3a, the Markov chain model predicts that to 
obtain improved performance it makes sense to decrease the frequency of I- 
frames (from q 0 = 1/6 to q 0 = 1/14 .. 1/20) and to instead use retransmission of P- 
frames. Figures 3b and 3c support this claim. There is an improvement in 
average PSNR in the range of 0.4-0.55 dB and fraction of reconstructed frames 
which have reconstruction errors less than t, with t = 0.5, 1.0, 1.5 dB, goes up by 
about 0.15-0.2. The d t curve of Figure 3c implies that there are about 20-25% 
more "good" frames when retransmission of P-frames is used instead of 
increasing the frequency of l-frame transmission. 

For "Mother and Daughter" the following list summarizes the parameters 
used for the Markov chain model: 

Pe = 0.1 

f = 1 0 frames/s 

average size of l-frame, lo = 18,010 bits 
average size of P-frame, U = 2,467 bits, 
R T = 54.84 kb/s 
n, = 3 

q 0 in set 1/6, 1/8, 1/20 
Figure 4a shows the resulting Pr(S 0 ), Figure 4b shows the average PSNR for 
various values of qo, and Figure 4c shows the resulting d t . To obtain Figures 4b 
and 4c, the P-frame retransmission rate, q fll derived from the Markov chain 
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analysis again was manually tweaked so that the total bitrate was very near to 
the source bitrate (also the total bitrate) for qo = 1/6. This was done to provide a 
fair comparison of results. Figure 4d shows the resulting total bitrate. In Figure 
4d R s denotes the source rate, R F denotes the rate used by the FEC, and Rj 
denotes the total bitrate. 

The Markov chain analysis in this case predicts that a gain in performance 
cannot be achieved by decreasing the frequency of l-frames; see Figure 4a. The 
PSNR and the d t curves of Figure 4b and 4c support this claim. The PSNR and 
the d t curves remain more or less flat. Note that the PSNR and the d t curves do 
not move down like the Pr(S 0 ) curve of Figure 4a. This can be attributed to the 
fact that the Markov chain model is a very simplistic model and is not based on 
the PSNR metric. More complex models can be thought of for modeling the 
PSNR performance, but they become complicated because of hte use of motion 
compensation in the decoder. 

4. System preferred embodiments 

Figure 5 shows in functional block form a portion of a preferred 
embodiment system which uses a preferred embodiment motion-compensated 
video transmission method. Such systems include video phone communication 
over the Internet with wireless links at the ends and voice packets interspersed 
with the video packets; a two-way communication version would have the 
structure of Figure 5 for both directions. In preferred embodiment communi- 
cation systems users (transmitters and/or receivers) hardware could include one 
or more digital signal processors (DSP's) and/or other programmable devices 
such as RISC processors with stored programs for performance of the signal 
processing of a preferred embodiment method. Alternatively, specialized 
circuitry (ASIC's) could be used with (partially) hardwired preferred embodiments 
methods. Users may also contain analog and/or mixed-signal integrated circuits 
for amplification or filtering of inputs to or outputs from a communications channel 
and for conversion between analog and digital. Such analog and digital circuits 
may be integrated on a single die. The stored programs, including codebooks, 
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may, for example, be in ROM or flash EEPROM or FeRAM which is integrated 
with the processor or external to the processor. Antennas may be parts of 
receivers with multiple finger RAKE detectors for air interface to networks such 
as the Internet. Exemplary DSP cores could be in the TMS320C6xxx and 
TMS320C5xxx families from Texas Instruments. 

5. Modifications 

The preferred embodiments may be modified in various ways while 
retaining one or more of the features of optimization of l-frame rate in view of 
repeated P-frame transmission possibilities. 

For example, the predictively-coded frames could include B-frames; the 
frame playout could include a large buffer and delay to allow from some 
automatic repeat request for l-frame packets to supersede some repeat P-frame 
packets; the network protocols could differ. 

Indeed, one can introduce the concept of using multiple servers to serve 
the same video receiving client. For example, presume the use of two video 
servers to serve the same client. This situation has two network channels 
feeding into the video client. Use one channel to transmit the l-frame and P- 
frame (without repetition) and then use the other channel to transmit the FEC P- 
frames. Note that the rate of video received at the client is the same as when a 
single server is used. Use of two channels improves the performance, because 
the probability of both the channels deteriorating at the same time decreases. 
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